MATP-636US PATENT

## Amendments t the Specificati n:

Please replace the paragraph [0002] with the following rewritten paragraph [0002]:

[0002] Advances in semiconductor technology have created reasonably-priced chips with literally hundreds of millions of transistors. This transistor budget has revealed the lack of scalability of both multi-issue uni-processor architectures, such as instruction level parallelism (ILP) (superscalar and VLIW), and of the classic vector architecture. The most common use for the increased transistor budget in CPU designs has been to increase the amount of on-chip cache. The performance increase in such CPUs, however, soon reached the point of diminishing returns is slowing down. On the other hand, main applications are shifting to multimedia operation, and the chip multiprocessor is beginning to make a mark because of its exploitation of parallelism.

Please replace the paragraph [0003] with the following rewritten paragraph [0003]:

[0003] As semiconductor design rules shrank, some scaling problems began to appear. Wire delays have failed to scale. This issue has been postponed for about one silicon process generation by moving to copper interconnects and low-k dielectrics. But CPU designers already know they should no longer expect a signal to propagate completely across a standard-sized die within a single clock tic. Such A solution of such scaling problems are driving CPU designers to is a chip multi-processor[[s]], integrating small CPUs on a die.

Please replace the paragraph [0004] with the following rewritten paragraph [0004]:

[0004] Another factor driving the partitioning of the single, monolithic CPU is bypass logicexcessive deep pipelining. As CPU architects add more stages to their pipelines to increase speed and more instruction issues to their ILP architectures to increase instructions-per-clock, the bypass logic, that routes partial results back to earlier stages in the pipeline undergoes a combinatoric explosion, which indicates that the number of pipeline stages has some optimum at a modest number of stages logic and complicated control logic are approaching their allowable limit.

Please replace the paragraph [0103] with the following rewritten paragraph [0103]:

,0

[0103] If willing to give up on using the full bus bandwidth for 32-bit transfers, the above may be simplified. In such an embodiment of the invention, a 32-bit write to a FIFO may be zero-filled to 64-bits and transferred as a 64-bit value. A 32-bit read from a FIFO may read the full 64 bits and only use the LSW. This coalesces the above cases. The decoder may prevent the pairing of instructions that both try to read from or both try to write to the same FIFO. Also, the decoder may prevent the next instruction from being paired with its predecessor, if that next instruction reads multiple values from the same FIFO. Deadlock is then avoided. An align instruction would not be needed. In general, an embodiment of a decoupled architecture has been described. It will be appreciated that four cross-bar DRAM clusters are also applicable to other chip-multiprocessor architectures. In addition, four cross-bar DRAM clusters are not always needed, and depend on the required resources.

Please replace the paragraph [0122] with the following rewritten paragraph [0122]:

[0122] The following <u>related</u> applications <del>are being have been</del> filed on <del>the same day as this application</del> March 31, 2003 (each having the same inventors):

VECTOR INSTRUCTIONS COMPOSED FROM SCALAR INSTRUCTIONS assigned Application No. 10/403,241; TABLE LOOKUP INSTRUCTION FOR PROCESSORS USING TABLES IN LOCAL MEMORY assigned Application No. 10/403,209; VIRTUAL DOUBLE WIDTH ACCUMULATORS FOR VECTOR PROCESSING assigned Application No. 10/403,315; and CPU DATAPATHS AND LOCAL MEMORY THAT EXECUTES EITHER VECTOR OR SUPERSCALAR INSTRUCTIONS assigned Application No. 10/403,216.

The following related application has been filed on March 7, 2003 (having the same inventors):

LOCAL MEMORY WITH OWNERSHIP THAT IS TRANSFERABLE BETWEEN NEIGHBORING PROCESSORS assigned Application No. 10/384,198.